Bernoulli distribution (bernoulli)#
The Bernoulli distribution models a single binary outcome: success (1) with probability \(p\) and failure (0) with probability \(1-p\).
Learning goals#
Recognize Bernoulli data and common modeling patterns.
Write the PMF and CDF in closed form.
Compute and interpret mean/variance and other moments.
Derive the likelihood and the MLE for \(p\).
Simulate Bernoulli trials (NumPy-only) and visualize behavior.
Use
scipy.stats.bernoulliandscipy.stats.fit.
Prerequisites#
Basic probability (PMF/CDF), expectation, and variance
Comfort with logs and derivatives
Notebook roadmap#
Title & Classification
Intuition & Motivation
Formal Definition
Moments & Properties
Parameter Interpretation
Derivations (Expectation, Variance, Likelihood)
Sampling & Simulation (NumPy-only)
Visualization (PMF, CDF, Monte Carlo)
SciPy Integration
Statistical Use Cases
Pitfalls
Summary
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from scipy import stats
from scipy.special import xlogy, xlog1py
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
import sys
import scipy
import plotly
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("SciPy:", scipy.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
SciPy: 1.15.0
Plotly: 6.5.2
Seed: 7
1) Title & Classification#
Name:
bernoulli(Bernoulli distribution)Type: Discrete
Support: \(x \in \{0, 1\}\)
Parameter space: \(p \in [0, 1]\) (often \(p \in (0,1)\) for a non-degenerate distribution)
Notation:
\(X \sim \mathrm{Bernoulli}(p)\) or \(X \sim \mathrm{Ber}(p)\).
You will also see the parameter written as \(\theta\) instead of \(p\).
2) Intuition & Motivation#
A Bernoulli random variable is the mathematical abstraction of a single yes/no trial.
What it models: a single occurrence/non-occurrence event.
Examples: click/no-click, fraud/not, defect/not, recovered/not at a fixed time, coin toss.
Typical use cases:
Binary labels in classification (Bernoulli likelihood underlying logistic regression).
A/B testing and conversion-rate estimation.
Random masks (e.g., dropout in neural networks).
Reliability: component works/fails.
Relations to other distributions#
Binomial: if \(X_1,\dots,X_n\) are i.i.d. Bernoulli\((p)\), then \(\sum_{i=1}^n X_i \sim \mathrm{Binomial}(n,p)\).
Categorical: Bernoulli is a categorical distribution with two categories.
Beta–Bernoulli: a Beta prior on \(p\) is conjugate and yields a closed-form posterior.
Geometric: the number of trials until the first success in repeated Bernoulli trials is geometric.
3) Formal Definition#
Let \(X \sim \mathrm{Bernoulli}(p)\) with \(p \in [0,1]\).
PMF#
A compact form (valid for \(x \in \{0,1\}\)) is: $\( \mathbb{P}(X=x) = p^x (1-p)^{1-x}. \)$
CDF#
Because this is a discrete distribution, the CDF is a step function: $\( F(x)=\mathbb{P}(X \le x)= \begin{cases} 0 & x < 0 \\ 1-p & 0 \le x < 1 \\ 1 & x \ge 1. \end{cases} \)$
Indicator view#
If \(A\) is an event with \(\mathbb{P}(A)=p\), then the indicator \(X=\mathbf{1}\{A\}\) is Bernoulli\((p)\).
def validate_p(p: float) -> float:
p = float(p)
if not (0.0 <= p <= 1.0):
raise ValueError(f"p must be in [0, 1], got {p!r}")
return p
def bernoulli_pmf(x, p: float):
"""PMF of Bernoulli(p) evaluated at x."""
p = validate_p(p)
x = np.asarray(x)
return np.where(x == 1, p, np.where(x == 0, 1.0 - p, 0.0)).astype(float)
def bernoulli_cdf(x, p: float):
"""CDF of Bernoulli(p) evaluated at x (step function)."""
p = validate_p(p)
x = np.asarray(x)
return np.where(x < 0, 0.0, np.where(x < 1, 1.0 - p, 1.0)).astype(float)
p_demo = 0.3
x_demo = np.array([-1, 0, 1, 2])
print("x:", x_demo)
print("pmf:", bernoulli_pmf(x_demo, p_demo))
print("cdf:", bernoulli_cdf(x_demo, p_demo))
x: [-1 0 1 2]
pmf: [0. 0.7 0.3 0. ]
cdf: [0. 0.7 1. 1. ]
4) Moments & Properties#
For \(X \sim \mathrm{Bernoulli}(p)\), define \(q = 1-p\).
Mean: \(\mathbb{E}[X] = p\)
Variance: \(\mathrm{Var}(X) = pq\)
Skewness (for \(p \in (0,1)\)): $\( \gamma_1 = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3} = \frac{1-2p}{\sqrt{pq}} \)$
Kurtosis:
Central 4th moment: \(\mu_4 = pq(1-3pq)\)
Excess kurtosis (kurtosis minus 3): $\( \gamma_2 = \frac{\mu_4}{(pq)^2} - 3 = \frac{1}{pq} - 6 \)\( (Undefined at \)p \in {0,1}\( because \)\sigma^2=0$.)
MGF and characteristic function#
MGF: $\( M_X(t)=\mathbb{E}[e^{tX}] = q + p e^t = 1-p + p e^t \)$
Characteristic function: $\( \varphi_X(t)=\mathbb{E}[e^{itX}] = 1-p + p e^{it} \)$
Entropy#
Using natural logarithms (nats): $\( H(X) = -p\log p - q \log q. \)\( This is maximized at \)p=1/2\( and equals 0 at \)p=0\( or \)p=1$.
def bernoulli_mean(p: float) -> float:
return validate_p(p)
def bernoulli_var(p: float) -> float:
p = validate_p(p)
return p * (1.0 - p)
def bernoulli_skewness(p: float) -> float:
p = validate_p(p)
v = bernoulli_var(p)
if v == 0.0:
return np.nan
return (1.0 - 2.0 * p) / np.sqrt(v)
def bernoulli_excess_kurtosis(p: float) -> float:
p = validate_p(p)
v = bernoulli_var(p)
if v == 0.0:
return np.nan
return 1.0 / v - 6.0
def bernoulli_mgf(t, p: float):
p = validate_p(p)
t = np.asarray(t)
return (1.0 - p) + p * np.exp(t)
def bernoulli_cf(t, p: float):
p = validate_p(p)
t = np.asarray(t)
return (1.0 - p) + p * np.exp(1j * t)
def bernoulli_entropy(p: float, base=np.e):
p = validate_p(p)
p_arr = np.asarray(p, dtype=float)
h = -(xlogy(p_arr, p_arr) + xlog1py(1.0 - p_arr, -p_arr)) # 0*log(0) -> 0
if base == 2:
return h / np.log(2)
if base != np.e:
return h / np.log(base)
return h
# Monte Carlo sanity checks
p = 0.27
n = 200_000
x = (rng.random(n) < p).astype(int)
mu_hat = x.mean()
var_hat = x.var(ddof=0)
mu = bernoulli_mean(p)
var = bernoulli_var(p)
mu3 = ((x - mu_hat) ** 3).mean()
mu4 = ((x - mu_hat) ** 4).mean()
skew_hat = mu3 / (var_hat ** 1.5)
excess_kurt_hat = mu4 / (var_hat**2) - 3.0
t = 1.2
mgf_hat = np.exp(t * x).mean()
cf_hat = np.exp(1j * t * x).mean()
print(f"p = {p}")
print(f"E[X] theory={mu:.4f} | MC={mu_hat:.4f}")
print(f"Var(X) theory={var:.4f} | MC={var_hat:.4f}")
print(f"skewness theory={bernoulli_skewness(p):.4f} | MC={skew_hat:.4f}")
print(f"excess kurtosis theory={bernoulli_excess_kurtosis(p):.4f} | MC={excess_kurt_hat:.4f}")
print(f"MGF(t) theory={bernoulli_mgf(t, p):.4f} | MC={mgf_hat:.4f}")
print(f"CF(t) theory={bernoulli_cf(t, p):.4f} | MC={cf_hat:.4f}")
print(f"Entropy (nats) theory={bernoulli_entropy(p):.4f}")
p = 0.27
E[X] theory=0.2700 | MC=0.2696
Var(X) theory=0.1971 | MC=0.1969
skewness theory=1.0361 | MC=1.0385
excess kurtosis theory=-0.9264 | MC=-0.9215
MGF(t) theory=1.6264 | MC=1.6255
CF(t) theory=0.8278+0.2517j | MC=0.8281+0.2513j
Entropy (nats) theory=0.5833
5) Parameter Interpretation#
The single parameter \(p\) is the probability of observing a 1 (“success”).
As \(p \to 0\), almost all mass is at 0.
As \(p \to 1\), almost all mass is at 1.
At \(p = 1/2\), the distribution is balanced.
A common reparameterization is the log-odds: $\( \operatorname{logit}(p) = \log\frac{p}{1-p}, \)$ which is the natural parameter in logistic regression.
6) Derivations#
Expectation#
Because \(X\in\{0,1\}\): $\( \mathbb{E}[X] = 0\cdot(1-p) + 1\cdot p = p. \)$
Variance#
Use \(\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\) and note \(X^2 = X\) for \(X\in\{0,1\}\): $\( \mathbb{E}[X^2] = \mathbb{E}[X] = p \quad\Rightarrow\quad \mathrm{Var}(X)=p-p^2=p(1-p). \)$
Likelihood (i.i.d. sample)#
Let \(x_1,\dots,x_n \in \{0,1\}\) be i.i.d. Bernoulli\((p)\). The likelihood is $\( L(p; x_{1:n}) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i}. \)\( Let \)k=\sum_i x_i\( be the number of successes. Then \)\( L(p; x_{1:n}) = p^k (1-p)^{n-k}. \)$
The log-likelihood is $\( \ell(p) = k \log p + (n-k)\log(1-p). \)$
Differentiate and set to zero: $\( \ell'(p)=\frac{k}{p}-\frac{n-k}{1-p}=0 \quad\Rightarrow\quad \hat{p}_{\text{MLE}}=\frac{k}{n}=\bar{x}. \)$
The second derivative \(\ell''(p)=-k/p^2-(n-k)/(1-p)^2<0\) for \(p\in(0,1)\), so this is a maximum.
def bernoulli_log_likelihood(p, x) -> float:
"""Bernoulli log-likelihood for i.i.d. data x in {0,1}."""
p = np.asarray(p, dtype=float)
x = np.asarray(x)
if not np.all((x == 0) | (x == 1)):
raise ValueError("x must contain only 0/1 values")
k = x.sum()
n = x.size
eps = 1e-12
p = np.clip(p, eps, 1.0 - eps)
return k * np.log(p) + (n - k) * np.log1p(-p)
# Visualize the log-likelihood and the MLE
p_true = 0.35
n = 40
x = (rng.random(n) < p_true).astype(int)
k = int(x.sum())
p_hat = k / n
grid = np.linspace(1e-4, 1.0 - 1e-4, 600)
ll = bernoulli_log_likelihood(grid, x)
fig = go.Figure()
fig.add_trace(go.Scatter(x=grid, y=ll, mode="lines", name="log-likelihood"))
fig.add_vline(x=p_true, line_dash="dash", line_color="gray", annotation_text="p_true")
fig.add_vline(x=p_hat, line_dash="dash", line_color="red", annotation_text="p_hat (MLE)")
fig.update_layout(
title=f"Bernoulli log-likelihood (n={n}, k={k})",
xaxis_title="p",
yaxis_title="log L(p)",
)
fig.show()
7) Sampling & Simulation (NumPy-only)#
A simple sampler uses inverse transform sampling with a uniform random variable:
Draw \(U \sim \mathrm{Uniform}(0,1)\).
Return \(X = 1\) if \(U < p\), else return \(X = 0\).
This works because \(\mathbb{P}(U < p)=p\).
Vectorized implementation:
Generate
U = rng.random(size)Compute
(U < p).astype(int)
def sample_bernoulli_numpy(p: float, size: int, rng: np.random.Generator) -> np.ndarray:
p = validate_p(p)
u = rng.random(size)
return (u < p).astype(int)
p = 0.4
n = 5_000
x = sample_bernoulli_numpy(p, n, rng=rng)
# Running mean to illustrate the law of large numbers
running_mean = np.cumsum(x) / (np.arange(n) + 1)
fig = go.Figure()
fig.add_trace(go.Scatter(y=running_mean, mode="lines", name="running mean"))
fig.add_hline(y=p, line_dash="dash", line_color="red", annotation_text="p")
fig.update_layout(
title="Law of Large Numbers: running mean of Bernoulli samples",
xaxis_title="n",
yaxis_title="cumulative mean",
)
fig.show()
8) Visualization#
We’ll visualize:
the PMF for several \(p\) values
the CDF (step function)
Monte Carlo samples: the empirical PMF compared to the theoretical PMF
p_values = [0.1, 0.5, 0.9]
# PMF
fig_pmf = go.Figure()
for p in p_values:
fig_pmf.add_trace(
go.Bar(
name=f"p={p}",
x=[0, 1],
y=[1 - p, p],
)
)
fig_pmf.update_layout(
title="Bernoulli PMF",
xaxis_title="x",
yaxis_title="P(X = x)",
barmode="group",
)
fig_pmf.show()
# CDF
x_grid = np.linspace(-0.5, 1.5, 400)
fig_cdf = go.Figure()
for p in p_values:
fig_cdf.add_trace(
go.Scatter(
name=f"p={p}",
x=x_grid,
y=bernoulli_cdf(x_grid, p),
mode="lines",
line_shape="hv",
)
)
fig_cdf.update_layout(
title="Bernoulli CDF (step function)",
xaxis_title="x",
yaxis_title="F(x)",
)
fig_cdf.show()
# Monte Carlo: empirical PMF vs theoretical PMF
p = 0.3
n = 10_000
x = sample_bernoulli_numpy(p, n, rng=rng)
emp = np.array([(x == 0).mean(), (x == 1).mean()])
theo = np.array([1 - p, p])
fig_mc = go.Figure()
fig_mc.add_trace(go.Bar(name="empirical", x=[0, 1], y=emp))
fig_mc.add_trace(go.Bar(name="theoretical", x=[0, 1], y=theo))
fig_mc.update_layout(
title=f"Empirical vs theoretical PMF (p={p}, n={n})",
xaxis_title="x",
yaxis_title="probability",
barmode="group",
)
fig_mc.show()
9) SciPy Integration#
SciPy provides the Bernoulli distribution as scipy.stats.bernoulli.
PMF / CDF:
stats.bernoulli.pmf,stats.bernoulli.cdfSampling:
stats.bernoulli.rvsFitting: use
scipy.stats.fit(dist, data, ...)(many discreterv_discreteobjects don’t expose a.fitmethod)
bernoulli = stats.bernoulli
p = 0.25
print("pmf(0), pmf(1):", bernoulli.pmf([0, 1], p))
print("cdf(0), cdf(1):", bernoulli.cdf([0, 1], p))
samples = bernoulli.rvs(p, size=10, random_state=rng)
print("rvs:", samples)
# Fit p with scipy.stats.fit
data = bernoulli.rvs(0.32, size=2_000, random_state=rng)
fit_res = stats.fit(bernoulli, data) # MLE by default
print(fit_res)
print("p_hat:", fit_res.params.p)
pmf(0), pmf(1): [0.75 0.25]
cdf(0), cdf(1): [0.75 1. ]
rvs: [0 0 0 1 0 0 0 1 0 0]
params: FitParams(p=0.31299999784461363, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
p_hat: 0.31299999784461363
10) Statistical Use Cases#
Hypothesis testing (proportion / coin fairness)#
Testing a Bernoulli parameter typically reduces to testing a binomial count \(K=\sum_i X_i\).
SciPy’s stats.binomtest provides an exact test for \(H_0: p=p_0\).
Bayesian modeling (Beta–Bernoulli)#
With a Beta prior \(p \sim \mathrm{Beta}(\alpha,\beta)\) and data \(x_{1:n}\), the posterior is: $\( p \mid x_{1:n} \sim \mathrm{Beta}(\alpha + k,\ \beta + n-k), \)\( where \)k=\sum_i x_i$.
Generative modeling#
Bernoulli likelihood is a workhorse for binary observations:
logistic regression / GLMs: \(p_i = \sigma(\eta_i)\), \(x_i\sim \mathrm{Bernoulli}(p_i)\)
naïve Bayes with binary features (BernoulliNB)
neural nets for binary outputs; binary cross-entropy is (negative) Bernoulli log-likelihood
dropout masks are Bernoulli draws applied to activations
# Hypothesis testing: is a coin fair?
n = 20
k = 15
p0 = 0.5
test = stats.binomtest(k=k, n=n, p=p0, alternative="two-sided")
ci = test.proportion_ci(confidence_level=0.95)
print(test)
print("95% CI for p:", (ci.low, ci.high))
# Bayesian modeling: Beta prior + Bernoulli data
alpha, beta = 2.0, 2.0 # prior strength and prior mean = 0.5
alpha_post = alpha + k
beta_post = beta + (n - k)
posterior_mean = alpha_post / (alpha_post + beta_post)
print("Posterior mean of p:", posterior_mean)
x_grid = np.linspace(0, 1, 400)
prior_pdf = stats.beta.pdf(x_grid, alpha, beta)
post_pdf = stats.beta.pdf(x_grid, alpha_post, beta_post)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_grid, y=prior_pdf, mode="lines", name=f"prior Beta({alpha:.0f},{beta:.0f})"))
fig.add_trace(
go.Scatter(
x=x_grid,
y=post_pdf,
mode="lines",
name=f"posterior Beta({alpha_post:.0f},{beta_post:.0f})",
)
)
fig.update_layout(
title="Beta–Bernoulli: prior vs posterior for p",
xaxis_title="p",
yaxis_title="density",
)
fig.show()
BinomTestResult(k=15, n=20, alternative='two-sided', statistic=0.75, pvalue=0.04138946533203125)
95% CI for p: (0.5089541282920425, 0.9134285308985655)
Posterior mean of p: 0.7083333333333334
11) Pitfalls#
Invalid parameters: \(p\notin[0,1]\) is not a valid Bernoulli distribution.
Degenerate boundaries: at \(p=0\) or \(p=1\), variance is 0 and quantities like skewness/kurtosis are undefined.
Log-likelihood at the boundaries:
log(p)orlog(1-p)can produce-inf.Use
np.log1p(-p)for stability when \(p\) is close to 1.When optimizing numerically, clip:
p = np.clip(p, eps, 1-eps).
Data validation: fitting assumes observations are exactly 0/1 (or booleans). If you have probabilities or varying success rates, you want a different model (e.g., Bernoulli with varying \(p_i\) or a Beta-Binomial).
# Numerical edge cases: log-likelihood at p=0 or p=1
x = np.array([0, 1, 1, 0, 1])
for p in [0.0, 1e-12, 0.5, 1 - 1e-12, 1.0]:
ll = bernoulli_log_likelihood(p, x)
print(f"p={p: .12f} -> log-likelihood={ll: .3f}")
p= 0.000000000000 -> log-likelihood=-82.893
p= 0.000000000001 -> log-likelihood=-82.893
p= 0.500000000000 -> log-likelihood=-3.466
p= 0.999999999999 -> log-likelihood=-55.262
p= 1.000000000000 -> log-likelihood=-55.262
12) Summary#
Bernoulli is the canonical binary distribution: \(X\in\{0,1\}\) and \(\mathbb{P}(X=1)=p\).
PMF: \(\mathbb{P}(X=x)=p^x(1-p)^{1-x}\) for \(x\in\{0,1\}\); the CDF is a step function.
Mean/variance: \(\mathbb{E}[X]=p\), \(\mathrm{Var}(X)=p(1-p)\).
Likelihood for i.i.d. data yields the MLE \(\hat{p}=\bar{x}\).
Sum of Bernoullis gives Binomial; Beta prior gives conjugate Beta posterior.